Supervised Learning Classification Project: AllLife Bank Personal Loan Campaign¶

Completed by : Sujit Tilakraj Thakur¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Before starting with our analysis and model , I (Sujit) will like to provide problem defination as it was asked in the Scoring Rubric¶

Problem Defination :

AllLife Bank aims to enhance its loan business by converting liability customers into personal loan customers while retaining them as depositors.

I (Sujit) as a Data Scientist, the objective is to develop a predictive model to identify potential customers likely to purchase personal loans. The goal is to understand key customer attributes driving purchases, providing actionable insights for the marketing department. The model's success will optimize target marketing strategies, increase the personal loan conversion rate, and identify specific customer segments for targeted campaigns.

Importing necessary libraries¶

In [1]:
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)


# To build model for prediction
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)


## To get confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

Loading the dataset¶

In [2]:
##  using read_csv  from pandas library to read our data 
data = pd.read_csv("C:/Users/sthakur4/OneDrive - Biogen/Documents/PGP MLAI/Machine Learning/project/Loan_Modelling.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [3]:
## making copy of our data to keep orignal data intact  using .copy()
df = data.copy() 
In [4]:
## Lets check first 5 rows of the data  using .head()
df.head() 
Out[4]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [5]:
## Lets take a look at last 5 rows using .tail()
df.tail()
Out[5]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [6]:
## Lets check shape of data 
df.shape
Out[6]:
(5000, 14)

we can see from above shape , head and tail commands that we have data with 5000 rows and 14 columns¶

¶

In [7]:
## Lets check data types of column 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

We can see from above info() , that all our columns are non null and they are all numeric in nature¶

In [8]:
## Lets look at statistical summary of the dataset using describe()b
df.describe().T
Out[8]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

We can see from above summary that,

A)ID:

  • The dataset contains 5,000 records with IDs ranging from 1 to 5,000.

B)Age:

  • Customers in the dataset have an average age of approximately 45 years, with a minimum age of 23 and a maximum age of 67.

  • The age distribution appears to be relatively symmetrical, with the median (50th percentile) and mean close in value. but still mean - median = 45.3 - 45 = 0.33 , That means it is slightly right skewed.

C) Experience:

  • The average professional experience of customers is around 20 years, with a minimum reported experience of -3 years (which might be an anomaly or data error).

  • Experience ranges from -3 to 43 years, indicating a diverse range of experience levels. It also shows that experience shows -3 as minimum that means there is some discrepancy in data , hence so lets clean our data while preprocessing

D) Income:

  • The average income is 73.7 dollars , with a minimum income of 8 dollars and a maximum of 224 dollars.

  • Income distribution is positively skewed, as the mean is higher than the median.

E) ZIPCode:

  • ZIP codes is currently saved as numerical ,which in preprocessing we will convert in category format

F) Family:

  • The average number of family members is approximately 2.4, with a range from 1 to 4.

  • The distribution is slightly positively skewed, with more customers having smaller families.

G) CCAvg:

  • The average spending on credit cards per month is 1.94 Dollars , ranging from 0 to 10 Dollars.

  • The distribution is positively skewed, indicating that most customers have lower credit card spending.

H)Education:

  • The majority of customers have an education level of 1 or 2, with a few having a level of 3.

  • Education levels are discrete and categorized.

I) Mortgage:

  • On average, customers have a mortgage of 56.50 Dollars , but the distribution is highly right skewed, with a wide range up to 635 Dollars . and mean being higher than median

J) Personal Loan:

  • The dataset is imbalanced for the "Personal Loan" which is our target variable, with only 9.6% of customers having taken a personal loan.

K) Securities Account, CD Account, Online, Credit Card:

  • These binary variables (0 or 1) indicate the presence of a securities account, CD account, online banking usage, and credit card ownership, respectively.
  • The proportions of customers with these attributes are relatively low, indicating that they are not common among the dataset.
In [9]:
## Lets see if there is any duplicate values
df[df.duplicated()].sum()
Out[9]:
ID                    0.0
Age                   0.0
Experience            0.0
Income                0.0
ZIPCode               0.0
Family                0.0
CCAvg                 0.0
Education             0.0
Mortgage              0.0
Personal_Loan         0.0
Securities_Account    0.0
CD_Account            0.0
Online                0.0
CreditCard            0.0
dtype: float64

We can see from above code that we dont have any duplicacy in our data

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

Lets answer first above questions and then we will do some extra univariate and bivariate analysis¶

In [10]:
## Before answering questions and performing Univariate and Bivariate analysis lets write all plotting functions 
In [11]:
## lets make a function for plotting histogram and boxplot together for target variable 
def dist_with_target_plots (data , predictor , target):   ## initializing function
    fig, axs = plt.subplots(2, 2, figsize=(12, 10))       ## figure size for proper visibility
    target_uniq = data[target].unique()                   ## getting unique value from target variable
    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))            ## setting title for first histogram
    sns.histplot(                                               ## Plotting histogram
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )
    axs[0,1].set_title("Distrinution of target for target =" + str(target_uniq[1]))      ## setting title for second histogram
    sns.histplot(
        data = data[data[target]==target_uniq[1]],                    ## Plotting histogram
            x= predictor,
            kde=True,
            ax =axs[0,1],
            color="teal",
            stat="density",
        )
    
    axs[1, 0].set_title("Boxplot w.r.t target")                      ## Plotting first boxplot with outliers
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")   

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")    ## Plotting second boxplot wi
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [12]:
## lets make a function for plotting proportion graph
def dist_with_target2(data, predictor, target):                                  ## initializing function
                                                                # Create a new DataFrame with the selected columns 
    dt = data[[predictor, target]].copy()

                                                                                        # Create age groups 
    bins = [20, 25, 30, 35, 40, 45, 50, 55, 60 , 65]  
    labels = ['20-25','25-30', '30-35', '35-40', '40-45', '45-50', '50-55', '55-60','60-65']

    dt[predictor] = pd.cut(df[predictor], bins=bins, labels=labels, right=False)

                                                                        # Calculate the distribution of the target variable for each age group
    dist_table = pd.crosstab(index=df[predictor], columns=df[target], normalize='index')

                                                                                    # Plot the distribution
    plt.figure(figsize=(10, 6))
    sns.barplot(x=dist_table.index, y=dist_table[1], color='skyblue', label='Bought Loan')
    sns.barplot(x=dist_table.index, y=dist_table[0], color='lightcoral', bottom=dist_table[1], label="Didn't Buy Loan")

    plt.title("Loan Purchase Distribution by Age Group")   ## setting title
    plt.xlabel("Age Group")                                ## changing x label
    plt.ylabel("Proportion")                               ## changing y label
    plt.legend(title=target)                               ## plotting legend                          
In [13]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):        ## initializing function
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])                                          # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))
                                                                                      
    plt.xticks(rotation=90, fontsize=15)                                       
    ax = sns.countplot(                                              ## plotting count plot
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )                                              # percentage of each class of the category
        else:
            label = p.get_height()                            # count of each level of the category

        x = p.get_x() + p.get_width() / 2                        # width of the plot
        y = p.get_height()                                    # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [14]:
## Writing function for a plot which will plot histogram and box plot 

def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):          ## initializing function

    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,                                       # Number of rows of the subplot grid= 2
        sharex=True,                                   # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )                                                   # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )                                              # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )                                                               # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )                                                                      # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )                                                                       # Add median to the histogram
In [15]:
## Writing a function for stacked barplot
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()                           ### unique from predictor
    sorter = data[target].value_counts().index[-1]              ### counting values of target
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(          ## making a cross tab of all values
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))      ## plotting bar graph
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [16]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):              ## initilizing functions

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))                    ## setting size for better visibility

    target_uniq = data[target].unique()                                    ## unique from target
 
    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))    ## setting title
    sns.histplot(
        data=data[data[target] == target_uniq[0]],                                 ## plotting histogram
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,                                                                   ## plotting histogram
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")   ## box plot with outliers

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,                                                                         ## box plot without outliers
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Q1 ) What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?¶

In [17]:
# Create subplots 
fig, axes = plt.subplots(1, 2, figsize=(16, 6)) ## making figure size for better visibility

# Plot 1: histogram

sns.histplot(x="Mortgage", data = df, ax=axes[0],kde=True) ## plotting histogram with kde 
axes[0].set_title('Histogram: Cuisine Type vs Cost of Order'); ## Setting Title of plot

# Plot 2: Boxplot

sns.set_style("whitegrid");  ## changing grid style
sns.boxplot(data=df, x = "Mortgage", ax=axes[1]) ## plotting boxplot
axes[1].set_title('Boxplot: Cuisine Type vs Cost of Order'); ## Setting Title of plot
In [18]:
Q1 = df["Mortgage"].quantile(0.25)  # Complete the code to find the 25th percentile and 75th percentile.
Q3 = df["Mortgage"].quantile(0.35)  # Complete the code to find the 75th percentile and 75th percentile.

IQR = Q3 - Q1               # Inter Quantile Range (75th perentile - 25th percentile)

lower_bound = Q1 - 1.5 * IQR  # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_bound = Q3 + 1.5 * IQR

outliers_count = ((df["Mortgage"] < lower_bound) | (df["Mortgage"] > upper_bound)).sum()  
                             ## getting count of outliers
percentage_outliers = (outliers_count / len(data)) * 100      ## calculating percentage of outliers 
print(f"Percentage of outliers in Mortgage column is: {percentage_outliers:.2f}%")
Percentage of outliers in Mortgage column is: 30.76%

Answer for Q1:¶

- We can see from above histogram with KDE on left side that , "Mortgage" attirbute is right skewed. And when we take a look on box plot which is presented on Right hand side , it shows that mortgage has alot of outliers¶

- We can see from the left side plotted histogram that maximum data of mortage lies around zero , and then data is distributed in small quantity from 100 to 600¶

- We have also calculated that almost 30.76% of data in mortgage is outlier¶

Q2) How many customers have credit cards?¶

Now this Answer can be solved in three parts¶

  • A) Customers with CreditCard from any other bank (excluding All Life Bank)

From "CreditCard" Column , If it says "1" that means customer has a credit card from some other bank than All Life Bank

  • B) Customer with Credit card from All life Bank

That means Customer which have atleast some spending from Credit Card (from CC Average column) but has "0" in Credit Card column that means doesnt have any other banks credit card that means it has our banks (All Life banks) credit card

  • C) Customer with Credit card of either one (Summation of above two )

Part A) Customers with CreditCard from any other bank (excluding All Life Bank)¶

In [19]:
## Getting Answer for Part A
Customer_CreditCard = df[df["CreditCard"]==1] ## making a subset of data where creditcard is 1
otherBank =Customer_CreditCard["ID"].value_counts().sum()
print("Hence ",Customer_CreditCard["ID"].value_counts().sum() ,"are total number of customers who owns credit card from any bank other than All Life Bank" ) ## Printing answer
Hence  1470 are total number of customers who owns credit card from any bank other than All Life Bank

Answer For Part 2-A)¶

Hence 1470 are total number of customers who owns credit card from any bank other than All Life Bank¶

B) Getting Answer for Part B - Where Customer owns credit cards from All Life Bank¶

In [20]:
## Getting Answer for Part B 
AlllifeCreditcard = df[(df["CreditCard"] == 0) & (df["CCAvg"] > 0)] ## making subset where Creditcard is 0 and ccavg is greater than 0
All_life=AlllifeCreditcard["ID"].value_counts().sum()
print("Hence ",AlllifeCreditcard["ID"].value_counts().sum() ,"are total number of customers who owns credit card from a All Life Bank" ) ## Pri
Hence  3452 are total number of customers who owns credit card from a All Life Bank

Answer For Part 2-B)¶

Hence 3452 are total number of customers who owns credit card from All Life Bank¶

C) All Customers with Credit Cards¶

In [21]:
## Getting Answer for Part C 
All_cus = All_life+ otherBank
print("All Customers with Credit card {}".format(All_cus))
All Customers with Credit card 4922

Answer For Part 2-C)¶

Hence all Customers with Credit card are total 4922¶

Conclusion for Q2 :¶

  • #### Customer with Credit Card from any bank other than All Life Bank are 1470 Customers

  • #### Customer with Credit Card from All Life Bank are 3452 Customers

  • #### Customer with Credit Card from All life or any other bank are 4492 Customers

Q3) What are the attributes that have a strong correlation with the target attribute (personal loan)?¶

In [22]:
mat = df.corr() ## Calculating correlation of our data and storing it in variable mat
plt.figure(figsize=(12,14)); ## Increasing figure size for proper visibility
sns.heatmap(mat,annot=True, fmt=".2f",annot_kws={"size": 10}); ## olotting heat map of same 
plt.title("Correlation HeatMap")
Out[22]:
Text(0.5, 1.0, 'Correlation HeatMap')

Answer¶

The attribute that has the strongest relation with target attribute called "Personal Loan" is Income with a correlation of 0.5¶

Q4) How does a customer's interest in purchasing a loan vary with their age?¶

In [23]:
dist_with_target_plots(df, 'Age', 'Personal_Loan') ### plotting distribution with target function plot of age and personalloan
In [24]:
dist_with_target2(df,'Age', 'Personal_Loan') ###Plotting proportional plot using manual made fucntion for age vs personal loan

Answer:¶

Loan Purchase distribution plot (second plot) :¶

  • We can see that the age group starts from 23 and ends at 67. And it is clear from the second plot of proportion that people of age 23 to 25 and of age 66 to 67 have no interest in purchasing loan
  • We can also see that people those are intrested in purchasing loan lies between age group of 26 to 65. In which people of age 26, 36 and 65 are the three top age of people who are highly intrested in purchasing loan

Loan Purchase vs Age Histogram and Box plot (First plot) :¶

  • As we can see from above plots that for a person to purchase a loan , the interest increases from age 26 till 36 years of age . And then it starts falling from towards 40 years of age and become approximately plateaued towards age of 55 and then it drops
  • While for people who didnt purchase loan , the kde rises till 30 and gets plateaued while reaching 60 and then drops

Q5) How does a customer's interest in purchasing a loan vary with their Education?¶

In [25]:
stacked_barplot(data, "Education", "Personal_Loan")  ## plot stacked bar plot education and personal loan
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
In [26]:
print("For Level 3 ,{}% of people purchased loan".format((205/1501)*100))  ## printing calculation
print("For Level 2 ,{}% of people purchased loan".format((182/1403)*100))
print("For Level 1 ,{}% of people purchased loan".format((93/2096)*100))
For Level 3 ,13.657561625582945% of people purchased loan
For Level 2 ,12.972202423378477% of people purchased loan
For Level 1 ,4.437022900763359% of people purchased loan

Answer :¶

  • we can see that people with higher education (i.e 3 - Professional / Advanced) has more interest in buying loan , and this interest of purchasing reduces with the education level
  • for level 3 , 13.65% of people has intrest and purchased loan
  • for level 2 , 12.97% of people has intrest and purchased loan
  • for level 1 , 4.43% of people has intrest and purchased loan

Extra Exploratory Data Analysis¶

Age :¶

In [27]:
histogram_boxplot(data, "Age") ## Plotting the histogram box plot for Age
In [28]:
print("Max age is {} and Min age is {} with median age of {}".format((df["Age"].max()),(df["Age"].min()),df["Age"].median()))
print("Mean age is {}".format(df["Age"].mean()))        ## printing values
Max age is 67 and Min age is 23 with median age of 45.0
Mean age is 45.3384

Observation :¶

  • We can see from above plot that Age is slightly right skewed as the mean is greater than the median
  • Max age is 67 and Min age is 23 with median age of 45.0 and mean of 45.33

Experience¶

In [29]:
histogram_boxplot(df,"Experience")  ## Plotting the histogram box plot for Experience

Observation for experience¶

  • From above plot we can see that experience column is almost normally distributed , with mean and median being the nearly same.

Income¶

In [30]:
histogram_boxplot(df,"Income")  ## Plotting the histogram box plot for Income

Observation on Income:¶

  • we can see from above plots that income is right skewed distribution with mean being more than median
  • We can see from box plot that we do have some outliers in income

CCAvg¶

In [31]:
histogram_boxplot(df,'CCAvg')  ##  create histogram_boxplot for CCAvg

Observation on CCAvg:¶

  • we can see from above plots that CCAvg is right skewed distribution with mean being more than median
  • We can see from box plot that we do have some outliers

Mortgage¶

In [32]:
histogram_boxplot(df,"Mortgage")  ## Complete the code to create histogram_boxplot for Mortgage

Observation on Mortgage¶

  • We can see from above plot that Mortgage is highly right skewed
  • A lot of data has value around "0" and then data starts increasing around 100 to 600 . Making it a right skewed distribution

Family¶

In [33]:
histogram_boxplot(df,"Family")  ##  create histogram_boxplot for Family

Observation on Family:¶

  • We can see from above plot that Mortgage is highly right skewed that is mean is more than median
  • Highest number of family is 1 and smallest entry if 3

Education¶

In [34]:
labeled_barplot(df,"Education")   ## Create labeled_barplot for Education

Observation on Education¶

We can see from above labeled bar plot that we have three different categories in education ,

  • "1" being undergrad
  • "2" being Masters
  • "3" being advanced / professional

We can see that our data has highest people with undergrad and lowest with masters

Securities Account¶

In [35]:
labeled_barplot(df,"Securities_Account")   ## create labeled_barplot for security account

Observation on Securities_Account¶

  • We can see that this feature has two values,
    • "0" for people Not having a securities account
    • "1" for people having a securities account
  • Maximum number of people in our dataset doesnot own a securities account , while 522 out of 5000 people own a securities account

CD_Account¶

In [36]:
labeled_barplot(df,"CD_Account")   ## create labeled_barplot for CD_Account

Observation on CD_Account¶

  • We can see that this feature has two values,
    • "0" for people Not having a CD account
    • "1" for people having a CD account
  • Maximum number of people in our dataset doesnot own a CD account , while 302 out of 5000 people own a CD account

Online¶

In [37]:
labeled_barplot(df,"Online")   ## create labeled_barplot for Online

Observation on Online¶

  • We can see that this feature has two values,
    • "0" for people Not using online banking
    • "1" for people using online banking
  • Maximum number of people in our dataset did use online banking , which accounts for 2984 out of 5000 people

Credit Card¶

In [38]:
labeled_barplot(df,"CreditCard")   ## create labeled_barplot for credit card

Observation on Credit_Card¶

  • We can see that this feature has two values,
    • "0" for people Not having a credit card
    • "1" for people having a credit card
  • Maximum number of people in our dataset doesnot own a Credit card , while 1470 out of 5000 people own a credit card

ZIPCode¶

In [39]:
df["ZIPCode"].nunique()  ## getting number of unique values in Zipcode
Out[39]:
467

We can see ZIPCode has 467 unique entries and it is a categorical variable , even if it is represented by numbers . hence lets group our ZIPCode into groups and then perform univariate Analysis .¶

Lets make group of ZIP by grouping together all zipcodes which start with same first two numbers¶

In [40]:
df_zip_group = df.copy() ## making a copy to keep our main data intact 
df_zip_group["ZIPCode"] = df_zip_group["ZIPCode"].astype("str") ## making zipcode in string
df_zip_group["ZIPcode"] = df_zip_group["ZIPCode"]
In [41]:
df_zip_group["ZIPCode_group"] = df_zip_group["ZIPCode"].str[0:2]  ## using first two digits of zip to group them
df_zip_group["ZIPCode_group"] = df_zip_group["ZIPCode_group"].astype("category")
df_zip_group.head()  ## showing first 5 rows
Out[41]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard ZIPcode ZIPCode_group
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0 91107 91
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0 90089 90
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0 94720 94
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0 94112 94
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1 91330 91
In [42]:
df_zip_group["ZIPCode_group"].nunique() ## getting unique values from new zip code group 
Out[42]:
7

now we can see we have 7 groups of zip which we can use for univariate and bivariate analysis¶

Hence performing univariate analysis¶

In [43]:
plt.figure(figsize=(10,8))
ax = sns.countplot(x='ZIPCode_group', data=df_zip_group) ## Plotting a count plot

# Adding labels to the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

plt.title('Countplot of ZIPCode Groups');   ## changing title of plot

As we can see from the count plot above ,the most number of entris are form group 94 of ZIPCode , which are 1472¶

the lowest number of customer for our Alllife bank is from area of zipcode group 96 with only 40 entries¶

following is order of highest to lowest ,¶

  • Zip code group 94 with total size of 1472
  • Zip code group 92 with total size of 988
  • Zip code group 95 with total size of 815
  • Zip code group 90 with total size of 703
  • Zip code group 91 with total size of 565
  • Zip code group 93 with total size of 417
  • Zip code group 96 with total size of 40

Bivariate Analysis¶

Correlation¶

In [44]:
plt.figure(figsize=(15, 7))  ## setting size of visual for proper visibility
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Complete the code to get the heatmap of the data
plt.show()

Observation for correlation plot¶

  • We can see the highest correlation is Age and Experience with correlation of 0.99
  • For our target variable "Personal Loan" the feature with highest correlation is 0.50 with Income
  • other high correlation with personal loan is CC Avg and CD Account with personal loan of 0.37 and 0.32 respectively

Personal Loan Vs Family¶

In [45]:
stacked_barplot(df,"Family","Personal_Loan")  ## plot stacked barplot for Personal Loan and Family
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------
In [46]:
print("For family of 4 , {} % of people purchased loan".format((134/1222)*100))
print("For family of 3 , {} % of people purchased loan".format((133/1010)*100))  ## printing calculation
print("For family of 2 , {} % of people purchased loan".format((106/1296)*100))
print("For family of 1 , {} % of people purchased loan".format((107/1472)*100))
For family of 4 , 10.965630114566286 % of people purchased loan
For family of 3 , 13.16831683168317 % of people purchased loan
For family of 2 , 8.179012345679013 % of people purchased loan
For family of 1 , 7.2690217391304355 % of people purchased loan

Observation on Family Vs Personal Loan¶

  • we can see people with family size of 3 had the highest percenatge of buying loan. It has 13.16% of people with family of 3 buying loan
  • while Family of 1, buys loan the least , with a percentage of 7.26%

Personal_Loan vs Securities_Account¶

In [47]:
stacked_barplot(df,"Securities_Account","Personal_Loan") ## plot stacked barplot for Personal Loan and Securities_Account
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------
In [48]:
print("For People with Securities Account , {} % of people purchased loan".format((60/522)*100))  ## printing calculation
print("For People with no Securities Account , {} % of people purchased loan".format((420/4478)*100))
For People with Securities Account , 11.494252873563218 % of people purchased loan
For People with no Securities Account , 9.379187137114783 % of people purchased loan

Observation on Securities Account Vs Personal Loan¶

  • we can see people with people with securities account had the highest percenatge of buying loan. It has 11.49% of people buying loan
  • while Family without securities account only 9.37% bought personal loan.
  • It also should be considered that in our data very small percentage of people has securities account , that is 552 out of 5000 , and rest of them do not hold a security account

Personal_Loan vs CD_Account¶

In [49]:
stacked_barplot(df,"CD_Account","Personal_Loan") ##  plot stacked barplot for Personal Loan and CD_Account
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
In [50]:
print("For People with CD Account , {} % of people purchased loan".format((140/302)*100))  ## printing calculation
print("For People with no CD Account , {} % of people purchased loan".format((340/4698)*100))
For People with CD Account , 46.35761589403973 % of people purchased loan
For People with no CD Account , 7.237122179650915 % of people purchased loan

Observation on CD Account Vs Personal Loan¶

  • we can see people with people with CD Account had the highest percenatge of buying loan. It has 46.35% of people buying loan
  • while People without CD account only 7.23% bought personal loan.
  • It also should be considered that in our data very small percentage of people has CD account , that is 302 out of 5000 , and rest of them do not hold a CD account

Personal_Loan vs Online¶

In [51]:
stacked_barplot(df,"Online","Personal_Loan") ##  plot stacked barplot for Personal Loan and Online
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------
In [52]:
print("For People with CD Account , {} % of people purchased loan".format((140/302)*100))  ## printing calculation
print("For People with Online Banking , {} % of people purchased loan".format((291/2984)*100))    
print("For People with no Online Banking , {} % of people purchased loan".format((189/2016)*100))
For People with CD Account , 46.35761589403973 % of people purchased loan
For People with Online Banking , 9.75201072386059 % of people purchased loan
For People with no Online Banking , 9.375 % of people purchased loan

Observation on Online Vs Personal Loan¶

  • we can see people with people with Online had the highest percenatge of buying loan. It has 9.75% of people buying loan
  • while People without Online banking only 9.35% bought personal loan.
  • We can see the numbers are almost alike , so we can here weakly conclude that online banking doesnt affect purchasing of personal loan

Personal_Loan vs CreditCard¶

In [53]:
stacked_barplot(df,"CreditCard","Personal_Loan") ##  plot stacked barplot for Personal Loan and CreditCard
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------
In [54]:
print("For People with CD Account , {} % of people purchased loan".format((140/302)*100))  ## printing calculation
print("For People with Credit Card , {} % of people purchased loan".format((143/1470)*100))    
print("For People with no Credit Card , {} % of people purchased loan".format((337/3530)*100))
For People with CD Account , 46.35761589403973 % of people purchased loan
For People with Credit Card , 9.727891156462585 % of people purchased loan
For People with no Credit Card , 9.546742209631729 % of people purchased loan

Observation on Credit Card Vs Personal Loan¶

  • we can see people with people with Credit card had the highest percenatge of buying loan. It has 9.72% of people buying loan
  • while People without Credit Card banking only 9.54% bought personal loan.
  • It also should be considered that in our data very small percentage of people has credit card , that is 1470 out of 5000 , and rest of them do not hold a Credit card account

Personal Loan vs Experience¶

In [55]:
distribution_plot_wrt_target(df,"Experience","Personal_Loan") ##  plot stacked barplot for Personal Loan and Experience

Loan Purchase vs Experience distribution plot :¶

  • For People around experience of 5-10 years of experience are most intrested in purchasing loan

Personal Loan vs Income¶

In [56]:
distribution_plot_wrt_target(df,"Income","Personal_Loan") ##  plot stacked barplot for Personal Loan and Income

Observation for Personal Loan VS Income¶

  • We can see from above plots on left hand side that for distribution when personal loan is not bought and distribution is income vs personal loan , it is a right skewed distribution
  • We can see from above plots on right hand side that for distribution when personal loan is bought and distribution is income vs personal loan , it is a left skewed distribution
  • this does direct towards that if the income is high thne its higher intrest of people in purchasing loan

Personal Loan vs CCAvg¶

In [57]:
distribution_plot_wrt_target(df,"CCAvg","Personal_Loan") ## plot stacked barplot for Personal Loan and CCAvg

Observation for CCAvg VS Personal Loan¶

  • We can see from above plots on left hand side that for distribution when personal loan is not bought and distribution is CCV Avg vs personal loan , it is a heavily right skewed distribution
  • We can see from above plots on right hand side that for distribution when personal loan is bought and distribution is CCAvg vs personal loan , it is a not that heavily right skewed distribution when compared to left side plot where target is 0
  • We can also see from box plot with outliers that for data of personal loan being 0 , it has more outliers
  • We can also see from box plot that median of buying a loan has higher CCAvg approximately of 3.8 and for not buying loan the median of CC Avg is low around 1.6

Personal Loan vs ZIpcode-group¶

In [58]:
sns.set(style="whitegrid")  # Optional: Set a background style

ax = sns.countplot(x='ZIPCode_group', hue='Personal_Loan', data=df_zip_group) ## Plot count plot

# Adding labels to the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Adding labels to the plot
plt.title('Stacked Bar Plot of ZIPCode Groups and Personal Loan')
plt.xlabel('ZIPCode Groups')
plt.ylabel('Count')

# Adding a legend with labels
plt.legend(title='Personal Loan', labels=['No Loan', 'Loan'])
Out[58]:
<matplotlib.legend.Legend at 0x1bb6c106710>

Lets calculate percentage of people purchasing loan in those zipcode groups¶

In [59]:
print( "{} % of people from zipcode group 90 has purchased loan".format((67/(67+636))*100))  ## Printing calculations
print( "{} % of people from zipcode group 91 has purchased loan".format((55/(55+510))*100))
print( "{} % of people from zipcode group 92 has purchased loan".format((94/(94+894))*100))
print( "{} % of people from zipcode group 93 has purchased loan".format((43/(43+374))*100))
print( "{} % of people from zipcode group 94 has purchased loan".format((138/(138+1334))*100))
print( "{} % of people from zipcode group 95 has purchased loan".format((80/(80+735))*100))
print( "{} % of people from zipcode group 96 has purchased loan".format((3/(3+37))*100))
9.53058321479374 % of people from zipcode group 90 has purchased loan
9.734513274336283 % of people from zipcode group 91 has purchased loan
9.51417004048583 % of people from zipcode group 92 has purchased loan
10.311750599520384 % of people from zipcode group 93 has purchased loan
9.375 % of people from zipcode group 94 has purchased loan
9.815950920245399 % of people from zipcode group 95 has purchased loan
7.5 % of people from zipcode group 96 has purchased loan

Observation for ZIPCode group VS Personal Loan¶

  • From above stacked bar plot and the calculation we found out that the ZIPCode area group of 93 has the highest conversion rate , and where people comparitively are purchasing loans with approximately 10.31% people of that area.
  • The zipcode group where the least number of percentage of people has purchased loan is area 96 , with only 7.5% but its understandable as we only have 40 customer from that zipcode group area.
  • Hence the next lowest group after zipcode group 96 will be group 94 with 9.375% of people having purchased loan .

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [60]:
df.head() ## checking first 5 rows
Out[60]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [61]:
## Lets check if experience column has wrong entry  as we found out from univariate analysis and from description that it might have some 
## discrepancies
df[df["Experience"] < 0]["Experience"].unique()  
Out[61]:
array([-1, -2, -3], dtype=int64)

As we can see from above that we have some wrong entries,Treating the negative values of Experience: We assume that these negative signs here are data input errors, so we will replace them with positive signs¶

In [62]:
# Correcting the experience values
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)

Feature Engineering¶

In [63]:
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
Out[63]:
467

We can see we have 467 unique values in ZIP Code column , we cant convert them to dummy column as there are a lot of unique valiues , lets take only two letter of ZIP and then see how many unique values do we have¶

In [64]:
df["ZIPCode"] = df["ZIPCode"].astype(str)         ## converting to string
print(
    "Number of unique values if we take first two digits of ZIPCode: ",
    df["ZIPCode"].str[0:2].nunique(),                                ## printing unique values
)
Number of unique values if we take first two digits of ZIPCode:  7
In [65]:
df["ZIPCode"] = df["ZIPCode"].str[0:2]                   ## Replacing the values of Zip codes with first two numbers 

df["ZIPCode"] = df["ZIPCode"].astype("category")           ## converting to category type 
In [66]:
## Converting the data type of categorical features to 'category'
cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")    #  convert the cat_cols to category

Outlier Detection¶

In [67]:
df.head() ## looking at first 5 rows 
Out[67]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91 4 1.0 2 0 0 0 0 0 1
In [68]:
Q1 = data.quantile(0.25)  # finding the 25th percentile 
Q3 = data.quantile(0.35)  # finding the 75th percentile 

IQR = Q3 - Q1               # Inter Quantile Range (75th perentile - 25th percentile)

lower = Q1 - 1.5 * IQR  # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
In [69]:
((df.select_dtypes(include=["float64", "int64"]) < lower)
    |(df.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100         ## printing % of outliers in all columns 
Out[69]:
Age                   50.06
CCAvg                 63.12
CD_Account             0.00
CreditCard             0.00
Education              0.00
Experience            57.28
Family                24.44
ID                    60.00
Income                73.82
Mortgage              30.76
Online                 0.00
Personal_Loan          0.00
Securities_Account     0.00
dtype: float64

Data Preparation¶

First of all lets drop ID Column as it is not important in our analysis context¶

In [70]:
df.drop(columns="ID",axis =1 , inplace = True) ## Dropping the ID column
In [71]:
X= df.drop(columns="Personal_Loan",axis =1 ) ## making independent variables in X
y= df["Personal_Loan"]       ## making dependent variable in Y as its our target variable 

Now as we have all important columns in our data set df , lets now do one hot encoding and dummy column prepartion where ever required¶

In [72]:
X = pd.get_dummies(X,columns=["ZIPCode","Education"],drop_first=True) ## one hot coding for ZIPCode and Education
X
Out[72]:
Age Experience Income Family CCAvg Mortgage Securities_Account CD_Account Online CreditCard ZIPCode_91 ZIPCode_92 ZIPCode_93 ZIPCode_94 ZIPCode_95 ZIPCode_96 Education_2 Education_3
0 25 1 49 4 1.6 0 1 0 0 0 1 0 0 0 0 0 0 0
1 45 19 34 3 1.5 0 1 0 0 0 0 0 0 0 0 0 0 0
2 39 15 11 1 1.0 0 0 0 0 0 0 0 0 1 0 0 0 0
3 35 9 100 1 2.7 0 0 0 0 0 0 0 0 1 0 0 1 0
4 35 8 45 4 1.0 0 0 0 0 1 1 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 29 3 40 1 1.9 0 0 0 1 0 0 1 0 0 0 0 0 1
4996 30 4 15 4 0.4 85 0 0 1 0 0 1 0 0 0 0 0 0
4997 63 39 24 2 0.3 0 0 0 0 0 0 0 1 0 0 0 0 1
4998 65 40 49 3 0.5 0 0 0 1 0 0 0 0 0 0 0 1 0
4999 28 4 83 3 0.8 0 0 0 1 1 0 1 0 0 0 0 0 0

5000 rows × 18 columns

Model Building¶

In [73]:
X_train, X_test , y_train , y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)  ## splitting test and train 
In [74]:
## Lets see how our test and train set are divided
print("Shape of training set", X_train.shape)
print("Shape of test set", X_test.shape)
print("-"*100)
print("Distribution of class in training data is ")
print(y_train.value_counts(normalize=True))
print("Distribution of class in test data is ")
print(y_test.value_counts(normalize=True))
Shape of training set (3500, 18)
Shape of test set (1500, 18)
----------------------------------------------------------------------------------------------------
Distribution of class in training data is 
0    0.904
1    0.096
Name: Personal_Loan, dtype: float64
Distribution of class in test data is 
0    0.904
1    0.096
Name: Personal_Loan, dtype: float64

Model Evaluation Criterion¶

Model can make wrong predictions as:

  1. Predicting a customer will take the personal loan but in reality the customer will not take the personal loan - Loss of resources
  2. Predicting a customer will not take the personal loan but in reality the customer was going to take the personal loan - Loss of opportunity

Which case is more important?

  • Losing a potential customer by predicting that the customer will not be taking the personal loan but in reality the customer was going to take the personal loan.

How to reduce this loss i.e need to reduce False Negatives?

  • Bank would want Recall to be maximized, greater the Recall higher the chances of minimizing false negatives. Hence, the focus should be on increasing Recall or minimizing the false negatives.

Model Building¶

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearnfunction will be used to plot confusion matrix.
In [75]:
def model_perf (model , predictors,target):        ## function initializing
    pred =  model.predict(predictors)              ## predicting values for dependent
    accuracy = accuracy_score(target,pred)         ## calculating accuracy score
    precision = precision_score(target,pred)         ## calculating Precision score
    recall= recall_score(target,pred)         ## calculating recall score
    f1 = f1_score(target,pred)         ## calculating f1 score score
    
    df_perf = pd.DataFrame({
        "Accuracy":accuracy,"Precision":precision,"Recall":recall,"f1-score":f1
    },index=[0])           ## making all score in a dataframe
    
    return df_perf
In [76]:
## lets make a function for confusion matrix
def conf_mat (model,predictors,target):                      ## Initializing the function
    pred=model.predict(predictors)                           ## predicting values for dependent using predictors
    cm=confusion_matrix(target,pred)                         ## calculating confusion matrix
    labels = np.asarray(
    [
        ["{0:0.0f}".format(item)+"\n{0:.2%}".format(item/cm.flatten().sum())]     ## labels for confusion matrix
        for item in cm.flatten()
    ]
    ).reshape(2,2)
     
    plt.figure(figsize=(6,5))               ## setting fig size for better visibility
    plt.title("Confusion Matrix")           ## setting title for plot
    sns.heatmap(cm,annot=labels,fmt="")    ## plotting heat map for confusion matrix
    plt.xlabel("Predicted")                 ## x label
    plt.ylabel("True")                       ## ylabel
In [77]:
## lets make function for visualizing tree
def show_me_tree(model, predictors):                  ## Initializing function
    feature_names=predictors.columns.tolist()            ## getting feature names 
    plt.figure(figsize=(20,30))                          ## setting fig size for better visibility
    out=tree.plot_tree(
        
        model,                                             ## plotting decision tree
        feature_names=feature_names,
        class_names=None,
        node_ids=False,
        filled=True,
        fontsize=9,
    )
    
    for o in out:                                           ## this will make sure that all arrows are drawn
        arrow=o.arrow_patch
        if arrow is not None:
            arrow.set_edgecolor("black")
            arrow.set_linewidth(1)
    plt.show()
In [78]:
## now lets make a function to plot feature importances
def show_me_feature_imp(model,predictors):             ## initializing function
    feature_names=predictors.columns.tolist()           ## Getting feature name 
    importances=model.feature_importances_              ## calculating feature importance
    indices=np.argsort(importances)                      ## sorting the importances
    
    fig,ax = plt.subplots(figsize=(10,8))               ## setting fig size for better visibility
    ax.barh(range(len(indices)),importances[indices],color="violet",align="center")    ## plotting bar graph
    ax.set_yticks(range(len(indices)),[feature_names[i] for i in indices])          ## setting y ticks
    ax.set_xlabel("Relative Importance")                     ## xlabel
    ax.set_ylabel("Feature Names")                           ## y label
    plt.show()
    

Building Decision Tree¶

In [79]:
model = DecisionTreeClassifier(criterion="gini",random_state=1)   ## initializing decision tree classifier
model.fit(X_train,y_train)                                        ## fitting model 
Out[79]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Checking model performance and confusion matrix on training data¶

In [80]:
basic_decision_tree_perf_train = model_perf(model,X_train,y_train)            ## calculating model performance
conf_mat(model ,X_train,y_train)                                              ## plotting confusion matrix
basic_decision_tree_perf_train          
Out[80]:
Accuracy Precision Recall f1-score
0 1.0 1.0 1.0 1.0

As we can see from above model performance , that this model is giving 100% for all the model evaluation parameters this means tree has outgrown to the last point where every leaf is pure, which means is overfitting , which is not a good sign and it must be tackled¶

Visualizing the decision tree¶

In [81]:
show_me_tree(model,X_train)        ## plotting decision tree
In [82]:
print(tree.export_text(model,feature_names=X_train.columns.tolist(),show_weights=True)) ## printing the rules for tree
|--- Income <= 104.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2519.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |--- Experience <= 12.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Experience >  12.50
|   |   |   |   |   |   |   |--- weights: [61.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- Age <= 30.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  30.00
|   |   |   |   |   |   |   |   |--- Experience <= 19.50
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  19.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Family <= 2.00
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  2.00
|   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- weights: [25.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- CCAvg <= 4.40
|   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- CCAvg >  4.40
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |--- Income >  92.50
|   |   |   |--- CCAvg <= 4.45
|   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |--- Age <= 61.50
|   |   |   |   |   |   |   |--- CCAvg <= 4.35
|   |   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  4.35
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  61.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |--- Experience <= 36.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |--- Experience >  36.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- Mortgage <= 74.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Mortgage >  74.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.45
|   |   |   |   |--- Mortgage <= 320.00
|   |   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  57.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  320.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|--- Income >  104.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [458.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- CCAvg <= 2.85
|   |   |   |   |   |   |--- Experience <= 4.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  4.50
|   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.85
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- Age <= 33.00
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  33.00
|   |   |   |   |   |   |--- Experience <= 22.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Experience >  22.50
|   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 80.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |--- Mortgage >  80.50
|   |   |   |   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 67.00] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 114.50
|   |   |   |--- Experience <= 3.50
|   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |--- Experience >  3.50
|   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- CCAvg <= 2.90
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- Income <= 109.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  109.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  2.90
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- ZIPCode_95 <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_95 >  0.50
|   |   |   |   |   |   |   |--- Income <= 110.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Income >  110.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |--- Income <= 113.50
|   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  113.50
|   |   |   |   |   |   |--- Age <= 62.00
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  62.00
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- weights: [0.00, 155.00] class: 1

In [83]:
show_me_feature_imp(model,X_train)    ## plotting feature importance for above model

The top 5 important feature for this model from above plot we can see are ,¶

  • Income
  • Family
  • Education_2 (masters)
  • Education_3 (Advanced/Professional)
  • CCAvg

Checking performance on test data¶

In [84]:
basic_decision_tree_perf_test = model_perf(model,X_test,y_test)       ## calculating performance on test data
conf_mat(model,X_test,y_test)                                         ## plotting confusion matrix
basic_decision_tree_perf_test  
Out[84]:
Accuracy Precision Recall f1-score
0 0.982667 0.94697 0.868056 0.905797

We can see performance of our model from above confusion matrix and performance table ,¶

The concern we have here is as for us Type II error is problematic that means our recall score should be strong and high¶

Hence lets perform some model improvements using pre and post pruning and then see which model is the best¶

Model performance Improvements¶

Pre Pruning¶

In [85]:
estimator=DecisionTreeClassifier(random_state=1)                         ## initilaizing decision tree classifier 
parameters = {
    "max_depth":np.arange(6,18),                                        ## getting every desired possible paramter in dictionary
    "min_samples_leaf":np.arange(1,11),
    "max_leaf_nodes":np.arange(1,11)
}

grid_obj = GridSearchCV(estimator,parameters,cv=5,scoring=make_scorer(recall_score))  ## running grid search cv
grid_obj = grid_obj.fit(X_train,y_train)     ## fitting data in grid search cv

estimator = grid_obj.best_estimator_    ## finding best estimator
estimator.fit(X_train,y_train)     ## fitting model with best estimator
Out[85]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=5, random_state=1)

Checking performance on training data after pre pruning¶

In [86]:
decision_tree_tune_perf_train = model_perf(estimator,X_train,y_train)   ## calculating performance of pre pruned model
conf_mat(estimator,X_train,y_train)                                    ## plotting confusion matrix 
decision_tree_tune_perf_train 
Out[86]:
Accuracy Precision Recall f1-score
0 0.978286 0.871429 0.907738 0.889213

As we can see this time for our training data performance it didnt give all performance metrics answers as 100% that means now in this model after pre pruning we dont have overfitting and our tree leaves at the end are not entirely pure compared to the base model¶

Visualizing the Tree¶

In [87]:
show_me_tree(estimator,X_train) ## plotting decision tree for pre pruned model
In [88]:
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator, feature_names=X_train.columns.tolist(), show_weights=True))
|--- Income <= 104.50
|   |--- weights: [2661.00, 31.00] class: 0
|--- Income >  104.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [458.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [8.00, 61.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [9.00, 74.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [28.00, 170.00] class: 1

In [89]:
show_me_feature_imp(estimator,X_train)   ## ploting feature importance of pre pruned model

We can see from above plot that the features which were important in our base model are showing important here as well in pre pruned model , like income , Family , Education_2, Education_3 and etc.¶

Checking performance on test data after pre pruning¶

In [90]:
decision_tree_tune_perf_test = model_perf(estimator,X_test,y_test) ## calculating performance of test data in pre pruned model
conf_mat(estimator,X_test,y_test)               ## plotting confusion matrix of test data of pre pruned model
decision_tree_tune_perf_test
Out[90]:
Accuracy Precision Recall f1-score
0 0.960667 0.777778 0.826389 0.801347

we can see from above performance plot and confusion matrix , that our model did well in train data where it has recall of 90%, but here for test data the recall score went down to 82.6% , which is not a good sign as our base model , where we let the tree grow to the purest leaves , had a higher recall value¶

This means pre pruned model didnt help us significantly , lets use cost complexity pruning in next step and see if post pruning can help us¶

Cost Complexity Pruning¶

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [91]:
clf = DecisionTreeClassifier(random_state=1)                 ## initializing decision tree classifier
path = clf.cost_complexity_pruning_path(X_train,y_train)              ## getting path
ccp_alphas,impurities = path.ccp_alphas , path.impurities                  ## getting alpha and impurity values
In [92]:
pd.DataFrame(path)                         ## showing path as data frame
Out[92]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000184 0.000552
2 0.000229 0.001009
3 0.000245 0.001499
4 0.000257 0.002013
5 0.000262 0.002537
6 0.000262 0.003061
7 0.000286 0.003632
8 0.000343 0.003975
9 0.000371 0.004718
10 0.000429 0.005146
11 0.000457 0.005603
12 0.000467 0.006070
13 0.000470 0.009831
14 0.000488 0.010318
15 0.000495 0.011309
16 0.000508 0.011817
17 0.000583 0.012400
18 0.000653 0.013053
19 0.000667 0.015723
20 0.000989 0.016712
21 0.000994 0.017706
22 0.001000 0.018706
23 0.001195 0.021097
24 0.001625 0.022723
25 0.001782 0.024505
26 0.001908 0.026413
27 0.002335 0.028748
28 0.002970 0.031718
29 0.008156 0.039874
30 0.025722 0.091318
31 0.034690 0.126007
32 0.047561 0.173568
In [93]:
### Lets plot Total Impurities of leaves v/s effective alpha

fig,ax = plt.subplots(figsize=(10,4))   ## increase size of plot for better visibility 
ax.plot(ccp_alphas[:-1],impurities[:-1],marker="o",drawstyle="steps-post")   ## plotting alpha and impurities
ax.set_title("Total Impurities of leaves v/s effective alpha")                ## Setting title
ax.set_xlabel("Effective Alpha")                                              ## setting x label
ax.set_ylabel("Total Impurities of leaves")                                     ## setting y label
Out[93]:
Text(0, 0.5, 'Total Impurities of leaves')

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [94]:
clfs=[]              ## making empty list for classifiers
for ccp_alpha in ccp_alphas:                
    clf = DecisionTreeClassifier(random_state=1,ccp_alpha=ccp_alpha)               ## fitting for all alphas
    clf.fit(X_train,y_train)
    clfs.append(clf)                  ## append list of classifiers

print("Number of nodes in last tree are {} with alpha of {}".format(clfs[-1].tree_.node_count,ccp_alphas[-1]))
Number of nodes in last tree are 1 with alpha of 0.04756053380018527

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [95]:
clfs=clfs[:-1]                                         ## here we are neglecting the last one as it will trim all the tree
ccp_alphas =ccp_alphas[:-1]

node_count = [clf.tree_.node_count for clf in  clfs]                 ## getting node counts
max_depth = [clf.tree_.max_depth for clf in clfs]                     ## getting max depths
 
fig,ax = plt.subplots(2,1,figsize=(8,4))
ax[0].plot(ccp_alphas,node_count,marker="o",drawstyle="steps-post")       ## plotting nodes with alphas
ax[0].set_xlabel("Alpha")                                                    ## x label
ax[0].set_ylabel("Node Counts")                                              ## ylabel
ax[0].set_title("Alpha vs Node Counts")                                       ## setting title
ax[1].plot(ccp_alphas,max_depth,marker="o",drawstyle="steps-post")                ## plotting alphas with max depth
ax[1].set_xlabel("Alpha")                                                    ## x label
ax[1].set_ylabel("Max Depth")                                                 ## setting y lable
Out[95]:
Text(0, 0.5, 'Max Depth')
Now lets see recall values for all classifiers for both test and train data¶
In [96]:
recall_train=[]                 ## Setting up a empoty list of recall values for training data 
for clf in clfs:
    pred_train = clf.predict(X_train)           ## predicting from the predictors  using clf
    value_train = recall_score(y_train,pred_train)    ## calculating recall score
    recall_train.append(value_train)         ## appending values in list
recall_test=[]                      ## Setting up a empoty list of recall values for testing data
for clf in clfs:                  
    pred_test = clf.predict(X_test)                      ## predicting from the predictors  using clf
    value_test = recall_score(y_test,pred_test)         ## calculating recall score
    recall_test.append(value_test)             ## appending values in list
    
Lets plot all recall values of train and test data vs alpha values¶
In [97]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))

# Plotting for alpha and recall for training
ax.plot(ccp_alphas, recall_train, marker="o", drawstyle="steps-post", label="Training Data")

# Plotting for alpha and recall for testing
ax.plot(ccp_alphas, recall_test, marker="o", drawstyle="steps-post", label="Testing Data")

# Setting labels for axes
ax.set_ylabel("Recall Value")
ax.set_xlabel("Effective Alpha")

# Adding a legend
ax.legend()
Out[97]:
<matplotlib.legend.Legend at 0x1bb6c9ac6d0>
In [98]:
best_index = np.argmax(recall_test)       ## finding index of alpha where recall for test was best
best_alpha = clfs[best_index]                     ## getting best classifier
print("Hence our best alpha from cost complexity pruning comes out to be {}".format(best_alpha.ccp_alpha))  ## printing result
Hence our best alpha from cost complexity pruning comes out to be 0.0006674876847290641

Post -Pruning¶

In [99]:
estimator_2 = DecisionTreeClassifier(
    ccp_alpha=best_alpha.ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1         
)            ## initializing decision tree with correct class weight and the best alpha
estimator_2.fit(X_train, y_train)         ## fitting model for post pruning
Out[99]:
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In [100]:
show_me_tree(estimator_2,X_train)  ## plotting decision tree for post pruned model
In [101]:
###  lets see  the rules for same tree
print(tree.export_text(estimator_2,feature_names=X_train.columns.tolist(),show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [369.60, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 81.50
|   |   |   |--- Age <= 36.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- weights: [1.65, 0.00] class: 0
|   |   |   |--- Age >  36.50
|   |   |   |   |--- weights: [9.15, 0.00] class: 0
|   |   |--- Income >  81.50
|   |   |   |--- CCAvg <= 4.40
|   |   |   |   |--- Age <= 46.00
|   |   |   |   |   |--- Income <= 90.50
|   |   |   |   |   |   |--- weights: [2.10, 0.00] class: 0
|   |   |   |   |   |--- Income >  90.50
|   |   |   |   |   |   |--- weights: [0.60, 1.70] class: 1
|   |   |   |   |--- Age >  46.00
|   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.90, 3.40] class: 1
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |--- Mortgage <= 154.00
|   |   |   |   |   |   |   |--- weights: [0.45, 7.65] class: 1
|   |   |   |   |   |   |--- Mortgage >  154.00
|   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |--- CCAvg >  4.40
|   |   |   |   |--- weights: [2.40, 0.00] class: 0
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 101.50
|   |   |   |   |   |--- CCAvg <= 2.95
|   |   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.95
|   |   |   |   |   |   |--- weights: [0.15, 2.55] class: 1
|   |   |   |   |--- Income >  101.50
|   |   |   |   |   |--- weights: [71.40, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |--- Income >  103.50
|   |   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |   |--- CCAvg <= 2.85
|   |   |   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.85
|   |   |   |   |   |   |   |--- weights: [0.00, 5.95] class: 1
|   |   |   |   |   |--- Income >  116.50
|   |   |   |   |   |   |--- weights: [0.00, 45.90] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- Age <= 34.50
|   |   |   |   |   |   |--- Experience <= 2.50
|   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |--- Experience >  2.50
|   |   |   |   |   |   |   |--- weights: [0.90, 0.00] class: 0
|   |   |   |   |   |--- Age >  34.50
|   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.27
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  3.27
|   |   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |--- weights: [0.15, 5.10] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 56.95] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 112.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [4.05, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |--- Income <= 111.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.40] class: 1
|   |   |   |   |   |   |   |--- Income >  111.50
|   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |--- weights: [0.30, 8.50] class: 1
|   |   |   |   |--- Age >  59.50
|   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |--- Income >  112.50
|   |   |   |--- weights: [0.90, 137.70] class: 1

Checking performance¶

In [102]:
post_pruning_train = model_perf(estimator_2,X_train,y_train)   ## calculing performance of training data after post pruning model
conf_mat(estimator_2,X_train,y_train) ##  create confusion matrix for train data
post_pruning_train
Out[102]:
Accuracy Precision Recall f1-score
0 0.993429 0.935933 1.0 0.966906

We can see from above performance table and confusion matrix for our post pruned model on train data has performed very well , with a recall of 1¶

In [103]:
show_me_feature_imp(estimator_2,X_train)   ## plotting feature importance of post pruned model

We can see again we have got the same feature as important that we were getting from base and pre pruned model¶

Checking Performance on Test Data after Post Pruning¶

In [104]:
## lets see performance of test data after post pruning
post_pruning_test = model_perf(estimator_2,X_test,y_test)    ## calculating performance of test data of post pruned model
conf_mat(estimator_2,X_test,y_test)                      ## plotting confusion matrix
post_pruning_test
Out[104]:
Accuracy Precision Recall f1-score
0 0.98 0.88 0.916667 0.897959

Here we can see from our performance table and confusion matrix that our recall score has increased to 91% for post pruned model on test data , which is a good sign as it is the highest recall on test data that we have got compared to base and pre pruned model¶

Model Comparison and Final Model Selection¶

In [105]:
## comparision
model_compare = pd.concat([basic_decision_tree_perf_train.T,basic_decision_tree_perf_test.T,decision_tree_tune_perf_train.T,decision_tree_tune_perf_test.T,post_pruning_train.T,post_pruning_test.T],axis=1)
model_compare.columns = ["Basic Train ","Basic Test","Pre Pruning Train","Pre Pruning Test","Post Pruning Train","Post Pruning Test"]
model_compare
Out[105]:
Basic Train Basic Test Pre Pruning Train Pre Pruning Test Post Pruning Train Post Pruning Test
Accuracy 1.0 0.982667 0.978286 0.960667 0.993429 0.980000
Precision 1.0 0.946970 0.871429 0.777778 0.935933 0.880000
Recall 1.0 0.868056 0.907738 0.826389 1.000000 0.916667
f1-score 1.0 0.905797 0.889213 0.801347 0.966906 0.897959

we can see from above comparison that our post pruning model has performed comparitively better than any other model, where for training set recall score is 1 and for test set recall is 91.67% , which is higher than any other test-train set.¶

Hence we will go with our post pruning model¶

In [106]:
## Hence below is our final model 
best_model = estimator_2
best_model
Out[106]:
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)

Now lets see tree and its rules¶

In [107]:
show_me_tree(best_model,X_train)
In [108]:
###  lets see  the rules for same tree
print(tree.export_text(best_model,feature_names=X_train.columns.tolist(),show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [369.60, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 81.50
|   |   |   |--- Age <= 36.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- weights: [1.65, 0.00] class: 0
|   |   |   |--- Age >  36.50
|   |   |   |   |--- weights: [9.15, 0.00] class: 0
|   |   |--- Income >  81.50
|   |   |   |--- CCAvg <= 4.40
|   |   |   |   |--- Age <= 46.00
|   |   |   |   |   |--- Income <= 90.50
|   |   |   |   |   |   |--- weights: [2.10, 0.00] class: 0
|   |   |   |   |   |--- Income >  90.50
|   |   |   |   |   |   |--- weights: [0.60, 1.70] class: 1
|   |   |   |   |--- Age >  46.00
|   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.90, 3.40] class: 1
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |--- Mortgage <= 154.00
|   |   |   |   |   |   |   |--- weights: [0.45, 7.65] class: 1
|   |   |   |   |   |   |--- Mortgage >  154.00
|   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |--- CCAvg >  4.40
|   |   |   |   |--- weights: [2.40, 0.00] class: 0
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 101.50
|   |   |   |   |   |--- CCAvg <= 2.95
|   |   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.95
|   |   |   |   |   |   |--- weights: [0.15, 2.55] class: 1
|   |   |   |   |--- Income >  101.50
|   |   |   |   |   |--- weights: [71.40, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |--- Income >  103.50
|   |   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |   |--- CCAvg <= 2.85
|   |   |   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.85
|   |   |   |   |   |   |   |--- weights: [0.00, 5.95] class: 1
|   |   |   |   |   |--- Income >  116.50
|   |   |   |   |   |   |--- weights: [0.00, 45.90] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- Age <= 34.50
|   |   |   |   |   |   |--- Experience <= 2.50
|   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |--- Experience >  2.50
|   |   |   |   |   |   |   |--- weights: [0.90, 0.00] class: 0
|   |   |   |   |   |--- Age >  34.50
|   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.27
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  3.27
|   |   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |--- weights: [0.15, 5.10] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 56.95] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 112.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [4.05, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |--- Income <= 111.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.40] class: 1
|   |   |   |   |   |   |   |--- Income >  111.50
|   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |--- weights: [0.30, 8.50] class: 1
|   |   |   |   |--- Age >  59.50
|   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |--- Income >  112.50
|   |   |   |--- weights: [0.90, 137.70] class: 1

In [109]:
## lets see what features are most important for this model 
show_me_feature_imp(best_model,X_train)

As we can see from above plot , for best model following are top 9 most important feature (in direction of most important to low importance)¶

  • Income
  • Education_2 (Masters Degree)
  • Education_3 (Advance / Professional Degree)
  • Family
  • CCAvg
  • Age
  • Experience
  • ZIPCode_94 ( all zips starting with number 94)
  • Mortagage

Actionable Insights and Business Recommendations¶

1. Targeted Marketing for Higher Income Groups:¶

Insights:¶
  • The attribute with the strongest correlation to loan purchases is "Income."
Recommendations:¶
  • Implement targeted marketing campaigns tailored to customers with higher incomes, emphasizing how personal loans can enhance their financial portfolios.

2. Educational Level Alignment:¶

Insights¶
  • Customers with advanced or professional education levels (level 3) show a higher interest in buying loans.
Recommendations¶
  • Design marketing materials that resonate with the aspirations of individuals with advanced education, positioning personal loans as tools for professional and financial growth.

3. Utilize Online Channels for Promotion:¶

Insights¶
  • A significant portion of customers use online banking facilities.
Recommendations¶
  • Leverage online channels for targeted promotional activities, providing easy access to loan information and application processes through the bank's online platforms.

4. Bundle Loan Offerings with Credit Card Benefits:¶

Insights¶
  • A substantial number of customers own credit cards issued by other banks.
Recommendations¶
  • Develop campaigns that bundle personal loan offerings with exclusive credit card benefits, enticing customers to consolidate their financial services with AllLife Bank.

5. Promote Additional Financial Services for Securities and CD Account Holders:¶

Insights¶
  • Customers with securities or CD accounts demonstrate interest in broader financial services.
Recommendations¶
  • Launch targeted cross-selling initiatives, offering additional financial products and services to customers with securities and CD accounts.

6. Family-Focused Marketing:¶

Insights¶
  • Families with a size of 3 exhibit the highest interest in loan purchases.
Recommendations¶
  • Tailor marketing strategies to highlight personal loans as solutions for family needs and financial aspirations, creating campaigns that resonate with families.

7. Engage Customers in the Age Range 26-65:¶

Insights¶
  • Individuals aged 26 to 65 show increased interest in loan purchases.
Recommendations¶
  • Craft marketing messages that align with the financial goals and life stages of customers in the age range of 26 to 65, emphasizing the benefits of personal loans during these pivotal years.

8. Address Data Discrepancies and Outliers:¶

Insights¶
  • Certain attributes, such as "Experience" and "Mortgage," exhibit anomalies and outliers.
Recommendations¶
  • Prioritize data cleaning and preprocessing to address discrepancies and outliers, ensuring the accuracy and reliability of insights derived from the dataset.

9. Enhance Campaign Targeting Techniques:¶

Insights¶
  • Previous campaigns with targeted marketing showed a healthy conversion rate.
Recommendations¶
  • Continuously refine targeting techniques by leveraging data-driven insights, adopting machine learning models to predict potential loan buyers, and ensuring that campaigns are dynamic and responsive to customer behaviors.

10. Encourage CD Account Holders to Explore Loan Options:¶

Insights¶
  • Customers with CD accounts exhibit a high interest in buying loans.
Recommendations¶
  • Design specific campaigns addressing CD account holders, showcasing the advantages of diversifying their financial portfolio with personal loans.

Implementing these recommendations can enhance the effectiveness of marketing strategies, broaden the customer base for personal loans, and contribute to the overall growth and profitability of AllLife Bank. Regularly monitor campaign performance and customer feedback to iterate and optimize future marketing initiatives.¶